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Abstract. The self-concordant-like property of a smooth convex func¬ 
tion is a new analytical structure that generalizes the self-concordant 
notion. While a wide variety of important applications feature the self- 
concordant-like property, this concept has heretofore remained unex¬ 
ploited in convex optimization. To this end, we develop a variable metric 
framework of minimizing the sum of a “simple” convex function and a 
self-concordant-like function. We introduce a new analytic step-size selec¬ 
tion procedure and prove that the basic gradient algorithm has improved 
convergence guarantees as compared to “fast” algorithms that rely on 
the Lipschitz gradient property. Our numerical tests with real-data sets 
shows that the practice indeed follows the theory. 


1 Introduction 

In this paper, we consider the following composite convex minimization problem: 

F*:= min{F(x):=/(x)+ 5 (x)}, (1) 

xGM n 

where / is a nonlinear smooth convex function, while g is a “simple” possibly 
nonsmooth convex function. Such composite convex problems naturally arise in 
many applications of machine learning, data sciences, and imaging science. Very 
often, / measures a data fidelity or a loss function, and g encodes a form of 
low-dimensionality, such as sparsity or low-rankness. 

To trade-off accuracy and computation optimally in large-scale instances of 
CD, existing optimization methods invariably invoke the additional assumption 
that the smooth function / also has an L-Lipschitz continuous gradient (cf., [11] 
for the definition). A highlight is the recent developments on proximal gradient 
methods, which feature (nearly) dimension-independent, global sublinear con¬ 
vergence rates mm- When the smooth / in dl|) also has strong regularity [Hj , 
the problem CD is also within the theoretical and practical grasp of proximal- 
(quasi) Newton algorithms with linear, superlinear, and quadratic convergence 
rates EH33- These algorithms specifically exploit second order information or 
its principled approximations (e.g., via BFGS or L-BFGS updates m- 

In this paper, we do away with the Lipschitz gradient assumption and in¬ 
stead focus on another structural assumption on / in developing an algorithmic 
framework for CD, which is defined below. 

Definition 1. A convex function f £ C 3 (R n ) is called a self-concordant-like 
function f £ T sc \, if: 

|| u || 2 , (2) 

for isK and Mf > 0, where ip(t) := /(x + tu) for any x £ dom(/) and u £ l n . 
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Definition |T| mimics the standard self-concordance concept (EH Definition 
4.1.1]) and was first discussed in Qj for model consistency in logistic regres¬ 
sion. For composite convex minimization, self-concordant-likc functions abound 
in machine learning, including but not limited to logistic regression, multinomial 
logistic regression, conditional random fields, and robust regression (cf., the ref¬ 
erences in ED- In addition, special instances of geometric programming [Bj can 
also be recast as m where / e J’ 8C i. 

The importance of the assumption / £ ,F sc i in Q is twofold. First, it en¬ 
ables us to derive an explicit step-size selection strategy for proximal variable 
metric methods, enhancing backtracking-line search operations with improved 
theoretical convergence guarantees. For instance, we can prove that our proxi¬ 
mal gradient method can automatically adapt to the local strong convexity of / 
near the optimal solution to feature linear convergence under mild conditions. 
This theoretical result is backed up by great empirical performance on real-life 
problems where the fast Lipschitz-based methods actually exhibit sublinear con¬ 
vergence (cf. Section H]). Second, the self-concordant-like assumption on / also 
helps us provide scalable numerical solutions of m for specific problems where / 
does not have Lipschitz continuous gradient, such as special forms of geometric 
programming problems. 

Contributions. Our specific contributions can be summarized as follows: 

1. We propose a new variable metric framework for minimizing the sum f+g of 
a self-concordant-like function / and a convex, possibly nonsmooth function 
g. Our approach relies on the solution of a convex subproblem obtained by 
linearizing and regularizing the first term /, and uses an analytical step-size 
to achieve descent in three classes of algorithms: first order methods, second 
order methods, and quasi-Newton methods. 

2. We establish both the global and the local convergence of different variable 
metric strategies. We pay particular attention to diagonal variable metrics 
since in this case many of the proximal subproblems can be solved exactly. 
We derive conditions on when and where these variants achieve locally linear 
convergence. When the variable metric is the Hessian of / at each iteration, 
we show that the resulting algorithm locally exhibits quadratic convergence 
without requiring any globalization strategy such as a backtracking line- 
search. 

3. We apply our algorithms to large-scale real-world and synthetic problems to 
highlight the strengths and the weaknesses of our variable-metric scheme. 

Relation to prior work. Many of the composite problems with self-concordant¬ 
like /, such as regularized logistics and multinomial logistics, also have Lipschitz 
continuous gradient. In those specific instances, many theoretically efficient algo¬ 
rithms are applicable Compared to these works, our framework 

has theoretically stronger local convergence guarantees thanks to the specific 
step-size strategy matched with / £ .F sc i. The authors of [T8] consider composite 
problems where / is standard self-concordant and proposes a proximal Newton 
algorithm optimally exploiting this structure. Our structural assumptions and 
algorithmic emphasis here are different. 

Paper organization. We first introduce the basic definitions and optimality 
conditions before deriving the variable metric strategy in Section [2] Section [3] 
proposes our new variable metric framework, describes its step-size selection 
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procedure, and establishes the convergence theory of its variants. Section Q] il¬ 
lustrates our framework in real and synthetic data. 

2 Preliminaries 

We adopt the notion of self-concordant functions in nnna to a different smooth 
function class. Then we present the optimality condition of problem ©■ 

2.1 Basic definitions 

Let g : R n -> R be a proper, lower semicontinuous convex function nn and 
dom(g) denote the domain of g. We use dg(x) to denote the subdifferential of g 
at x £ dom (g) if g is nondifferentiable at x and Vg(x) to denote its gradient, 
otherwise. Let / : R" —> R be a C 3 (dom(/)) function (i.e., / is three times 
continuously differentiable). We denote by V/(x) and V 2 /(x) the gradient and 
the Hessian of / at x, respectively. Suppose that, for a given x £ dom(/), 
V 2 /(x) is positive definite (i.e., V 2 /(x) £ 5?+), we define the local norm of a 
given vector u £ R” as ||u|| x := [u T V 2 /(x)u] 1 / 2 . The corresponding dual norm 
of u, ||u||* is defined as ||u||* := max{u T v | ||v|| x < 1} = [u T V 2 /(x) _1 u] 1 / 2 . 

2.2 Composite self-concordant-like minimization 

Let / £j 7 sc i(R n ) and g be proper, closed and convex. The optimality condition 
for © can be concisely written as follows: 

0 £ V/(x*) + dg(x*). (3) 

Let us denote by x* as an optimal solution of ©. Then, the condition © is 
necessary and sufficient. We also say that x* is nonsingular if V 2 /(x*) is positive 
definite. We now establish the existence and uniqueness of the solution x* of ©, 
whose proof can be found in the appendix. 

Lemma 1. Suppose that f £-7^1 (R n ) satisfies Definition^ for some Mf > 0. 
Suppose further that V 2 /(x) >- 0 for some x £ dom(/). Then the solution x* of 
© exists and is unique. 

For a given symmetric positive definite matrix H, we define a generalized prox¬ 
imal operator prox H -i g as: 

prox H ^i ff (x) := argrnin {fl'(z) + (1/2) ||z - x^x }. (4) 

Due to the convexity of g , this operator is well-defined and single-valued. If we 
can compute prox H -i g efficiently (e.g., by a closed form or by polynomial time 
algorithms), then we say that g is proximally tractable. Examples of proximal 
tractability convex functions can be found, e.g., in M- Using prox H -i g , we can 
write condition © as: 

x*-H- 1 V/(x*)£(I + H- 1 d ff )(x*) <«=► x* = prox H -i g (x* — H _1 V/(x*)). 

This expression shows that x* is a fixed point of TZu(-) '■= P rox H- 1 g((‘) ~ 
H _1 V/(-)). Based on the fixed point principle, one can expect that the iter¬ 
ative sequence {x A: } A (j generated by x fc+1 := TZn{x k ) converges to x*. This 
observation is made rigorous below. 
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3 Our variable metric framework 

We first present a generic variable metric proximal framework for solving (HD- 
Then, we specify this framework to obtain three variants: proximal gradient, 
proximal Newton and proximal quasi-Newton algorithms. 

3.1 Generic variable metric proximal algorithmic framework 

Given x fc £ dom (F) and an appropriate choice H*, £ S" + , since / £ iF sc i, one 
can approximate / at x fc by the following quadratic model: 

Q Hfc (x,x fc ) := /(x fc ) + (V/(x fc ), x — x fc ) + i(H fc (x - x fe ),x - x fc ). (5) 

Our algorithmic approach uses the variable metric forward-backward framework 
to generate a sequence {x fe } fc>Q starting from x° £ dom(F) and update: 

x k+i x fc + akd k ( 6 ) 

where ak £ (0,1] is a given step-size and d fc is a search direction defined by: 

d k :=s k — x. k , with s fe := argmin {Qn k (x, x fc ) + p(x)j . (7) 

X k J 

In the rest of this section, we explain how to determine the step size ak in the 
iterative scheme (0 optimally for special cases of H*,. For this, we need the 
following definitions: 

A/c := || d fc || x fe, r k :=M f \\d%, and p k := ||d fc || Hfe = <H fc d fc , d fc )V 2 . (8) 


3.2 Proximal-gradient algorithm 


When the variable matrix Hfc is diagonal and g is proximally tractable, we can 
efficiently obtain the solution of the subproblem m in a distributed fashion or 
even in a closed form. Hence, we consider := diag(Dfe > i, • • • , Dfc >n ) 

with T>k,i > 0, for i = 1, ,n. Lemma [2j whose proof is in the appendix, 
provides a step-size selection procedure and proves the global convergence of 
this proximal-gradient algorithm. 


Lemma 2. Let {x fc } fe>Q be a sequence generated by © and © starting from 
x° £ dom(F). For \k, rk and /3k defined by ©, we consider the step-size ctk 


as: 





(9) 


If Pk r k < ( e rk — 1)A|, then ak £ (0,1] and: 


F(x fc+1 ) < F(x fc ) 


ft 

rk 



hi 1 + 




(10) 


Moreover, this step-size ak is optimal ( w.r.t. the worst-case performance). 
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Algorithm 1 (Proximal-gradient algorithm with a diagonal variable metric) 

Initialization: Given x° £ dom(F), and a tolerance e > 0. 

for k = 0 to k max do 

1. Choose D fc £ <S” + (e.g., using := L*A where Lk is given by (fTTll l. 

2. Compute the proximal-gradient search direction d fc as 0. 

3. Compute /3k '■= !|d fe ||E> fc , rk ■= M; ||d fc || 2 and \ k := ||d fc || xfc . 

4. If /3k < £ then terminate. 

5. If 0krk < (e rfc — 1)A k, then compute ak := A In ^1 + ^* 2 * ^ and update x fc+1 := 

x fe + Qfed fc . Otherwise, set x fc+1 := x fc and update Dfc+i from Dfe. 

end for 


By our condition, the second term on the right-hand side of m is always 
positive, establishing that the sequence {F(x fc )| is decreasing. Moreover, as 
e rk — 1 > r-fe, the condition /3|rfe < ( e rk — 1)A 2 can be simplified to f3k < A k- 
It is easy to verify that this is satisfied whenever Dfe X V 2 /(x fe ). In such cases, 
our step-size selection ensures the best decrease of the objective value regarding 
the sclf-concordant-like structure of / (and not the actual objective instance). 
When j3k > A k, we scale down Dfe until f3k < X k - It is easy to prove that the 
number of backtracking steps to find is time constant. 

Now, by using our step-size (ED, we can describe the proximal-gradient algo¬ 
rithm as in Algorithm |T] 

We combine the above analysis to obtain the following proximal gradient al¬ 
gorithm for solving (0. The main step in Algorithm [l] is to compute the search 
direction d fe at Step 2, which is equivalent to the solution of the convex sub¬ 
problem 0. The second main step is to compute Xk = (V 2 /(x fc )d fe , d fc ) 1 / 2 . 
This quantity requires the product of Hessian V 2 /(x fe ) of / and d fc , but not the 
full-Hessian. It is clear that if /3fc = 0 then d fe = 0 and x fc+1 = x fc and we obtain 
the solution of 0, i.e., x fc = x*. The diagonal matrix D^. can be updated as 
Dfc + i := cDfc for a given factor c > 1. 

We now explain how the new theory enhances the standard backtracking 
linesearch approaches. For simplicity, let us assume Dfe := Lfcl, where I is the 
identity matrix. By a careful inspection of 03. we see that L k = cr max (V 2 /(x ,c )) 
achieves the maximum guaranteed decrease (in the worst case sense) in the 
objective. There are many principled ways of approximating this constant based 
on the secant equation underlying the quasi-Newton methods. In Section 0 we 
use Barzilai-BenTal’s rule: 

: = - where §fc : = x " - xfe_1 and y k : = V/(x fc )-V/(x fc - 1 ). (11) 

(y te ,s fe ) 

We then deviate from the standard backtracking approaches. As opposed to, 
for instance, checking the Armijo-Goldstein condition, we use a new analytic 
condition (i.e., Step 5 of Algorithm 0, which is computationally cheaper in 
many cases. Our analytic step-size then further refines the solution based on 
the worst-case problem structure, even if the backtracking update satisfies the 
Armijo-Goldstein condition. 

Surprisingly, our analysis also enables us to also establish local linear con¬ 
vergence as described in Theorem 0 under mild assumptions. The proof can be 
found in the appendix. 
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Theorem 1. Let {x fc } fc>Q be a sequence generated by Algorithm[ 7J Suppose that 
the sub-level set £f(F(x 0 )) := {x G dom(F) : F(x) < F(x 0 )} is bounded and 
V 2 / is nonsingular at some x G dom (/). Suppose further that D& := L k I F rl„ 
for given r > 0. Then, |x fc } converges to x* the solution of ©. Moreover, 
if p* '■= maxjLfe/cr^ in — 1,1 — L k /af, a ^\ < 4 for k sufficiently large then the 
sequence {x fc } locally converges to x* at a linear rate, where crY n and cr^x are 
the smallest and the largest eigenvalues of V 2 /(x*) ; respectively. 

Linear convergence: According to Theorem [1] linear convergence is only pos¬ 
sible when the condition number k of the Hessian at the true solution satisfies 
k = ^max/^min < 3. While this seems too imposing, we claim that, for most 
/ and g , this requirement is not too difficult to satisfy (see also the empirical 
evidence in Section EO). This is because the proof of Theorem ED only needs the 
smallest and the largest eigenvalues of V 2 /(x*), restricted to the subspaces of 
the union of x* — x fc for k sufficiently large, to satisfy the conditions imposed 
by p*. For instance, when g is based on the id-norm/the nuclear norm, the dif¬ 
ferences x* — x k have at most twice the sparsity/rank of x* near convergence. 
Given such subspace restrictions, one can prove, via probabilistic assumptions on 
/ (cf-j PQ)i that the restricted condition number is not only dramatically smaller 
than the full condition number k of the Hessian V 2 /(x*), but also it can even 
be dimension independent with high probability. 

3.3 Proximal-Newton algorithm 

The case H*, = V 2 /(x fc ) deserves a special attention as the step-size selection 
rule becomes explicit and backtracking-free. The resulting method is a proximal- 
Newton method and can be computationally attractive in certain big data prob¬ 
lems due to its low iteration count. 

The main step of the proximal-Newton algorithm is to compute the proximal- 
Newton search direction d ; as: 

d fc := s k — x fc , where s fc := argmin {<3v 2 /(x fc )( x ’ x k ) + g(x)} . (12) 

Then, it updates the sequence {x fc } by: 

x fc+1 := x fc + a k d k = (1 - «fc)x fc + a k s k , (13) 

where a k G (0, 1] is the step size. If we set a k = 1 for all k > 0, then (1131) 
is called the full-step proximal-Newton method. Otherwise, it is a damped-step 
proximal-Newton method. 

First, we show how to compute the step size a k in the following lemma, which 
is a direct consequence of Lemma [2] by taking H*, = V 2 /(x fc ). 

Lemma 3. Let {x fe } fc>Q be a sequence generated by the proximal-Newton scheme 
(USD starting from x° G dom(F). Let X k andr k be as defined by ((HI). If we choose 
the step-size a k = rf 1 In (1 + r k ) then: 

F(x k+1 ) < F(x k ) - rf 1 X k [(1 + r" 1 ) In (1 + r k ) - l] . (14) 

Moreover, this step-size a k is optimal ( w.r.t. the worst-case performance). 

Next, Theorem [5] proves the local quadratic convergence of the full-step 
proximal-Newton method, whose proof can be found in the appendix. 
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Theorem 2. Suppose that the sequence {x fc } fc>Q is generated by dsn with full- 
step, i.e., OLk = 1 for k > 0. If rk < ln(4/3) « 0.28768207 then it holds that: 



(15) 


where is t/ie smallest eigenvalue o/V 2 /(x fc ). Consequently, if we choose 
x° such that Ao < cr m i n (V 2 /(x 0 )) ln(4/3), t/ien the sequence | A k/\J ^min j 1 con ~ 
verges to zero at a quadratic rate. 

Theorem [2] rigorously establishes where we can take full steps and still have 
quadratic convergence. Based on this information, we propose the proximal- 
Newton algorithm as in Algorithm [21 


Algorithm 2 (Prototype proximal-Newton algorithm) 

Initialization: Given x° € dom (F) and o £ (0, cr m m(V 2 /(x 0 )) ln(4/3)]. 

for k = 0 to k max do 

1. Compute s k by (1121) . Then, define d fc := s k — x fc and A*, := ||d fe || x fc. 

2. If A*, < e, then terminate. 

3. If A k > cr, then compute rk := M/||d fc ||2 and ak := ^ln(l + rk ); else ctk := 1. 

4. Update x fc+1 := x fc + Ofed fc . 

end for 


The most remarkable feature of Algorithm [2] is that it does not require any 
globalization strategy such as backtracking line search for global convergence. 
Complexity analysis. First, we estimate the number of iterations needed when 
Afc < cr to reach the solution x fe such that < e for a given tolerance e > 0. 
Based on the conclusion of Theorem[21 we can show that the number of iterations 
of Algorithm [21 when Afc > cr does not exceed /e max := log 2 ( ''j j • Finally, 
we estimate the number of iterations needed when A k > cr. From Lemma [31 we 
see that for all k > 0 we have Afc > cr and rk > cr. Therefore, the number of 


iterations is 


F(x u )-F(x*) 

b(o’) 


J , where if(r) := r ((1 + r : ) ln(l + r) — 1)) > 0. 


3.4 Proximal quasi-Newton algorithm 

In many applications, estimating the Hessian V 2 /(x fc ) can be costly even though 
the Hessian is given in a closed form (cf., Section Q]). In such cases, variable met¬ 
ric strategies employing approximate Hessian can provide computation-accuracy 
tradeoffs. Among these approximations, applying quasi-Newton methods with 
BFGS updates for Hfc would ensure its positive definiteness. Our analytic step- 
size procedures with backtracking automatically applies to the BFGS proximal- 
quasi Newton method, whose algorithm details and convergence analysis are 
omitted here. 


4 Numerical experiments 

We use a variety of different real-data problems to illustrate the performance 
of our variable metric framework using a MATLAB implementation. We pick 
two advanced solvers for comparison: TFOCS [4] and PNOPT [8]. TFOCS hosts 
accelerated first order methods. PNOPT provides a several proximal-(quasi) 
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Newton implementations, which has been shown to be quite successful in lo¬ 
gistic regression problems [5]. Both use sophisticated backtracking linesearch 
enhancements. We benchmark all algorithms with performance profiles [7]. 

A performance profile is built based on a set S of n s algorithms (solvers) and a 
collection V of n p problems. We first build a profile based on computational time. 
We denote by T PiS := computational time required to solve problem p by solver s. 
We compare the performance of algorithm s on problem p with the best perfor¬ 
mance of any algorithm on this problem; that is we compute the performance ra¬ 
tio r PtS := min{ j% s . lg5} • Now, let p s (r) := ^size{p £ V : r P}S < r} for f £ R+. 
The function p s : R — > [0,1] is the probability for solver s that a performance 
ratio is within a factor t of the best possible ratio. We use the term “perfor¬ 
mance profile” for the distribution function p s of a performance metric. In the 
following numerical examples, we plotted the performance profiles in log 2 -scale, 
i.e. ps(r) := ^-siz e{p £ V : log 2 (r PiS ) < r := log 2 f}. 


4.1 Sparse logistic regression 

We consider the classical logistic regression problem of the form m ■■ 

N 

min{Ar- 1 ^log(l + e-w ( < wW ’- x > + ^) + piV" 1 / 2 Hx^ }, (16) 

3 =1 

where x £ is an unknown vector, p is an unknown bias, and and w J are 
observations where j = 1, • • • , N. The logistic term in (fT6l) is self-concordant-like 
with Mf := max Hw^) ||2 pQ. In this case, the smooth term in (11611 has Lipschitz 
gradient, hence several fast algorithms are applicable. 

Figure 1 illustrates the performance profiles for computational time (left) and 
the number of prox-operations (right) using the 36 medium size problem^. For 
comparison, we use TFOCS-N07, which is Nesterov’s 2007 two prox-method; and 
TFOCS-AT, which is Auslender and Teboulle’s accelerated method, PNOPT 
with L-BFGS updates, and our algorithms: proximal gradient and proximal- 
Newton. From these performance profiles, we can observe that our proximal 
gradient is the best one in terms of computational time and the number of prox- 
operations. In terms of time, proximal-gradient solves upto 83.3% of problems 
with the best performance, while these numbers in TFOCS-N07 and PNOPT- 
LBFGS are 2.7%. Proximal Newton algorithm solves 11.1% problems with the 
best performance. In prox-operations, proximal-gradient is also the best one in 
75% of problems. 

We now show an example convergence behavior of our proximal-gradient 
algorithm via two large-scale problems with p = 0.1. The first problem is 
rcvl_train.binary with the size p = 20242 and N = 47236 and the second 
one is real-sim with the size p = 72309 and N = 20958. For comparison, we 
use TFOCS-N07 and TFOCS-AT. For this example, PNOPT (with Newton, 
BFGS, and L-BFGS options) and our proximal-Newton do not scale and are 
omitted. 

Figure [2] shows that our simple gradient algorithm locally exhibits linear con¬ 
vergence whereas the fast method TFOCS-AT shows a sublinear convergence 
rate. The variant TFOCS-N07 is the Nesterov’s dual proximal algorithm, which 
exhibits oscillations but performs comparable to our proximal gradient method 

1 Available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ 
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Fig. 1. Computational time (left) and number of prox-operations (right) 
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Fig. 2. Left: rcvl_train.binary, and Right: real-sim. 


in terms of accuracy, time, and the total number of prox operations. The compu¬ 
tational time and the number of prox-operations in these both problems are given 
as follows: Proximal-gradient: (15.67s, 698), (13.71s, 152); TFOCS-AT: (20.57s, 
678), (33.82s, 466); TFOCS-N07: (17.09s, 1049), (22.08s, 568), respectively. For 
these data sets, the relative performance of the algorithms is surprisingly con¬ 
sistent across various regularization parameters. 

4.2 Restricted condition number in practice 

The convergence plots in Figure [2] indicate that the linear convergence condi¬ 
tion in Theorem [I] may be satisfied. In fact, in all of our tests, the proximal 
gradient algorithm exhibits locally linear convergence. Hence, to see if Remark 
1 is grounded in practice, we perform the following test on the a#a dataset 1 , 
consisting of small to medium problems. We first solve each problem with the 
proximal-Newton method up to 16 digits of accuracy to obtain x*, and we calcu¬ 
late V 2 /(x*). We then run our proximal gradient algorithm until convergence, 
and during its linear convergence, we record ||V 2 /(x*)(x* — x fe )[| 2 /||x* — x fc |||, 
and take the ratios of the maximum and the minimum to estimate the restricted 
condition number for each problem. 

Figure [3] illustrates that while the condition number of the Hessian V 2 /(x*) 
can be extremely large, as the algorithm navigates to the optimal solution x* 
through sparse subspaces, the restricted condition number estimates are in fact 
very close to 3. Given that algorithm still exhibit linear convergence for the 
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Fig. 3. Restricted condition number (left), and condition number (right) estimates 


cases # = 2,3,4, 5, 6,8 (where our condition cannot be met), we believe that 
the tightness of our convergence condition is an artifact of our proof and may 
be improved. 


4.3 Sparse multinomial logistic regression 

For sparse multimonomial logistic regression, the underlying problem is formu¬ 
lated in the form of ©, which the objective function / is given as: 


N m m 

/(X) := TV" 1 £ [log (l + ^ e (- (i) . x(i) >)-^y^wW.xO)' 


(17) 


j—1 i— 1 i—1 

where X can be considered as a matrix variable of size m x p formed from 
X^ 1 ),--- ,X( m ). Other vectors, and are given as input data for j = 
1,..., TV. The function / has closed form gradient as well as Hessian. However, 
forming a full hessian matrix V 2 /(x) is especially costly in large scale problems 
when TV 1. In this case, proximal-quasi-Newton methods are more suitable. 
First, we show in Lemma 0] that / satisfies Definition [lj whose proof is in the 
appendix. 


Lemma 4. The function f defined by ED is convex and self-concordant-like in 


,<J) I 


the sense of Definition^ with the parameter Mf := y/6N 1 max^ M 

The performance profiles of 20 small-to-medium size problems 1 are shown in 
Figure [I] in terms of computational time (left) as well as number of prox- 
operations (right), respectively. Both proximal-gradient method and proximal- 
Newton method with BFGS have good performance. They can solve unto 55% 
and 45% problems with the best time performance, respectively. These methods 
are also the best in terms of prox-operations (70% and 30%). 


4.4 A sytlized example of a non-Lipschitz gradient function for © 

We consider the following convex composite minimization problem by modifying 
one of the canonical examples of geometric programming [6]: 


min {/(x) := £ e a " x+bi + c T x} + 5 (x), (18) 

2=1 

where 12 is a simple convex set, a^,c £ K" and bi £ R are random, and g 
is the Td-norm. After some algebra, we can show that / satisfies Definition |T] 
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t more than 2 X - times worse than the best one (#prox) 


Fig. 4. Computational time (left), and number of prox-operations (right) 


with Mf := max {||a,|| 2 : 1 < i < m}. Unfortunately, / does not have Lipschitz 
continuous gradient in K". 

We implement our proximal-gradient algorithm and compare it with TFOCS 
and PNOPT-LBFGS. However, TFOCS breaks down in running this example 
due to the estimation of Lipschitz constant, while PNOPT is rather slow. Several 
tests on synthetic data show that our algorithm outperforms PNOPT-LBFGS. 
As an example, we show the convergence behavior of both these methods in 
Figure [5] where we plot the accuracy of the objective values w.r.t. the number of 
prox-operators for two cases of e = lCU b and e = 10 -12 , respectively. As we can 




Fig. 5. Relative objective values w.r.t. #prox: left: e = 10 6 , and right: e = 10 12 


see from this figure that our prox-gradient method requires many fewer prox- 
operations to achieve a very high accuracy compared to PNOPT. Moreover, our 
method is also 20 to 40 times faster than PNOPT in this numerical test. 


5 Conclusions 

Convex optimization efficiency relies significantly on the structure of the objec¬ 
tive functions. In this paper, we propose a variable metric method for minimizing 
the sum of a self-concordant-like convex function and a proximally tractable con¬ 
vex function. Our framework is applicable in several interesting machine learn¬ 
ing problems and do not rely on the usual Lipschitz gradient assumption on 
the smooth part for its convergence theory. A highlight of this work is the new 
analytic step-size selection procedure that enhances backtracking procedures. 
Thanks to this new approach, we can prove that the basic gradient variant of 
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our framework has improved local convergence guarantees under certain condi¬ 
tions while the tuning-free proximal Newton method has locally quadratic con¬ 
vergence. While our assumption on the restricted condition number in Theorem 
1 is not deterministically verifiable a priori , we provide empirical evidence that 
it can hold in many practical problems. Numerical experiments on different ap¬ 
plications that have both self-concordant-like and Lipschitz gradient properties 
demonstrate that the gradient algorithm based on the former assumption can 
be more efficient than the fast algorithms based on the latter assumption. As a 
result, we plan to look into fast versions of our gradient scheme as future work. 


References 

1. F. Bach. Self-concordant analysis for logistic regression. Electron. J. Statist., 
4:384-414, 2010. 

2. F. Bach. Adaptivity of averaged stochastic gradient descent to local strong con¬ 
vexity for logistic regression. 2013. 

3. A. Beck and M. Teboulle. A Fast Iterative Shrinkage-Thresholding Algorithm for 
Linear Inverse Problems. SIAM J. Imaging Sciences, 2(l):183-202, 2009. 

4. S. Becker, E. J. Candes, and M. Grant. Templates for convex cone problems with 
applications to sparse signal recovery. Mathematical Programming Computation, 
3(3):165-218, 2011. 

5. S. Becker and M.J. Fadili. A quasi-Newton proximal splitting method. In Adv. 
Neural Information Processing Systems, 2012. 

6. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 
2004. 

7. E.D. Dolan and J.J. More. Benchmarking optimization software with performance 
profiles. Math. Program., 91:201-213, 2002. 

8. J.D. Lee, Y. Sun, and M.A. Saunders. Proximal Newton-type methods for convex 
optimization. SIAM J. Optim., 24(3): 1420-1443, 2014. 

9. H. Mine and M. Fukushima. A minimization method for the sum of a convex 
function and a continuously differentiable function. J. Optim. Theory Appl, 33:9- 
23, 1981. 

10. Y. Nesterov. Introductory lectures on convex optimization: a basic course, vol¬ 
ume 87 of Applied Optimization. Kluwer Academic Publishers, 2004. 

11. Y. Nesterov. Gradient methods for minimizing composite objective function. Math. 
Program., 140(1):125-161, 2013. 

12. Y. Nesterov and A. Nemirovski. Interior-point Polynomial Algorithms in Convex 
Programming. Society for Industrial Mathematics, 1994. 

13. J. Nocedal and S.J. Wright. Numerical Optimization. Springer Series in Operations 
Research and Financial Engineering. Springer, 2 edition, 2006. 

14. N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Opti¬ 
mization, 1(3):123-231, 2013. 

15. S. M. Robinson. Strongly Regular Generalized Equations. Mathematics of Opera¬ 
tions Research, Vol. 5, No. 1 (Feb., 1980), pp. 43-62, 5:43-62, 1980. 

16. R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. 
Princeton University Press, 1970. 

17. M. Schmidt, N.L. Roux, and F. Bach. Convergence rates of inexact proximal- 
gradient methods for convex optimization. NIPS, Granada, Spain, 2011. 

18. Q. Tran-Dinh, A. Kyrillidis, and V. Cevher. Composite self-concordant minimiza¬ 
tion. J. Mach. Learn. Res. (accepted), 15:1-54, 2014. 

19. Y. X. Yuan. Recent advances in numerical methods for nonlinear equations and 
nonlinear least squares. Numerical Algebra, Control and Optimization, l(l):15-34, 
2011 . 


Composite convex minimization involving self-concordant-like cost functions 


13 


A Appendix: Composite convex minimization involving 
self-concordant-like cost functions 

We derive some fundamental properties of self-concordant-like functions, intro¬ 
duce the notion of scaled proximal operators, and provide the full-proofs of the 
technical results in the main text. 

A.l Properties of self-concordant-like functions 

We define H 2 /(x)[u,u] := ||u|| x and Z) 3 /(x)[u, u, u] := (D 3 /(x)[u]u, u), fol¬ 
lowing the notations in mm- An equivalent definition of self-concordant-like 
functions is provided by the following theorem [T| . 

Theorem 3. A convex function f £ C 3 : R" —>■ R is Mf -self-concordant-like if 
and only if for any x, ui, U 2 , U 3 £ R", we have: 

|£> 3 /(x)[ui,u 2 ,u 3 ]| < M f ||U !|| 2 ||u 2 || x ||u 3 || x . 

Let / : R" — > R be an M/-self-concordant-like function. Then: 

a) The function / g (x) := a+ (a, x) + (1 /2)( Ax, x) + / (x) is M/-self-concordant- 
like, for any a £ R, a £ R", and positive symmetric matrix A : R” —> R". 

b) The function af is M/-self-concordant-like for any a > 0. 

c) Let g be an M s -self-concordant-like funciton. Then the function (/ + g) is 
M-self-concordant-like, where M := ma x{Mf,M g }. 

d) The function /(Ax + b) is {Mf || A|| 2 )-self-concordant-like, for any matrix 
A : R m — > R”, x £ R m , and b £ R n . 

We will repeatedly use the following inequalities in the rest of this appendix. 

Theorem 4. Let f : R" —» R be an Mf -self-concordant-like function, and define 
A x (y) := ||y - x|| x and r x (y) := M f ||y - x || 2 for all x, y £ dom (/). We have 
the following inequalities for any x, y £ dom (/): 

a) Bounds on the local norm: 

e^ M A x (y) < A y (x) < e r ^ zi A x (y). (19) 

b) Bounds on the Hessian matrix: 

e -r x (y) V 2j( x ) _< V 2 /(y) -< e r*(y) V 2 /( X ). (20) 

c) Bounds on the gradient vector: 

7* ( r x(y)) A x (y ) 2 < (V/(y) - V/(x), y - x) < 7 (r x (y)) A x (y) 2 , (21) 

where 7 *(r) := — r -1 ^, anc ^ 7 ( T ) := 

d) Bounds on the function value: 

w*(r x (y))A x (y ) 2 < /(y) - /(x) - V/(x) T (x) < w(r x (y))A x (y) 2 , (22) 

where := -— ^ 2 r ~ 1 and c j(t) := e ~ 2 _1 are both strictly convex and 

increasing. 
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Proof. We denote y t := x + t(y — x) for notational convenience, where t £ [0,1]. 
First we prove CEHD. Consider the function <f>(t) := log ((V 2 /(yt)u, u)) for some 
u £ R n . We have, by Definition [TJ that: 


\<m i = 


|-P 3 /(yt)[u^u^u] 

D 2 f{ y*)[u,u] 


< M f ll u ll 2 ■ 


Set u = y—x, and then we have </>(0) = log ^||y — x||^ and 0(1) = log ^||y — x| 
Integrating 0 , (-) over the interval [0,1], we obtain: 


log (lly — x ||y) — log (j|y — x || 2 ^ < M f 


y-x : 


which leads to (ITTJfl . 

Next, we prove EU1) . Consider the function 0(f) := (V 2 /(y 4 )u, u) = ||u|| 2 t 
for some u £ M”. We have 0(0) = ||u|| 2 and 0(1) = ||u|| 2 . By Theorem [3] we 
obtain: 

W{t)\ = |£> 3 /(yt)[y-x,u,u]| < M f ||y — x|| 2 0(f), 
or, equivalently, 

< Mf ||y-x|| 2 . 


din 0(f) 


dt 

We get (BOl) by integrating both sides over [0,1]. 

Now, we prove (EH). By the mean-value theorem, we have: 

(/(y) - V/(x), y - x) = [\v 2 f(y t )x,x)di. 

JO 

Applying the right-hand side of (|20[). we obtain: 

[ (V 2 /(y t )x,x)df < [ exp (Mf ||yt — x|| 2 ) (V 2 /(x)x, x)dt, 

Jo Jo 

which leads to the right-hand side of ED- Similarly, we can prove the left-hand 
side of EH- 

Finally, m is a direct consequence of EH , since: 

1 1 


/(y)-/(x)-(V/(x),y-x)= [ y(V/(y t ) — V/(x),y t — x)dt. 

Jo 1 


Hence, all the statements of Theorem 0] are proved. 


□ 


A. 2 Proof of Lemma [H The existence and uniqueness of x* 

Consider the level set T_f(x) := {y £ dom(F) : F( y) < F(x)}. By E^l) and the 
convexity of g, for any y £ Cp( x ) and v £ <9g(x), we have: 

F(x) > F( y) > F(x) + (V/(x) + v, y - x) + w*(r x (y))A x (y) 2 , 

where we use the notations r x (y) and A x (y) in Theorem^ Applying the Cauchy- 
Schwarz inequality and we obtain: 

w*(r x (y))A x (y) < ||V/(x) + v||* . 
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Note that A x (y) > o- m i n ||x — y|| 2 , where cr min is the smallest eigenvalue of 
V 2 /(x). This inequality implies that: 

^*( r x(y))r- x (y) < (Mf/a m i n ) || V/(x) + v||* . 

The function ip{t) := fw*(f) is increasing in [0,+oo) and its values are also in 
[0,+oo), so its inverse ^ _1 is also increasing in [0, +oo). This implies that the 
level set £/(x) is bounded, and thus problem o has a solution x*. 

Let y ^ x* be a point in dom (/). By E^l) and the optimality condition ([3]), 
for any v* £ Sg(x*), we have: 

/(y) > /(x*) + (V/(x*), y — x*) + cu*(r x *(y))A x *(y) 2 , 
g{y) > g(x*) + (v*, y - x*) = g(x*) - (V/(x*), y - x*}. 

Summing up the two inequalities, we obtain 

F{ y) > F(x*) + w»(r x *(y))A x *(y) 2 . (24) 

By (1201) and the non-singularity of V 2 /(x) for some x £ dom(/), the function 
/ is strictly convex, and the uniqueness of x* follows. □ 


A.3 Proof of Lemma [2} Step-size selection strategy 

Since s k is the solution of the convex subproblem 0, we have 0 £ V/(x fe ) + 
Dfcd fc + dg(s k ), or, equivalently: 

— (V/(x fc ) + Dfcd fc ) £ dg(s k ). (25) 

Using (00) and m, we can derive: 

/(x fc+1 ) < f(x k ) + a k {Vf(x k ),d k ) + u (a k r k ) a 2 k \ 2 k . (26) 


Since x fc+1 = (1 — Ofe)x fe + a k s k , it follows by the convexity of g and (l25l) that: 


ff(x fc+1 ) < g(x k ) - a fc (V/(x fc ) + D fc d fc ,d fc ). (27) 


Summing up (l2flll and GZD, we obtain the following estimate: 

F(x k+1 ) < F(x k ) — ipk{otk). 


where i /’fe(r) := /3 2 r — A 2 w(rfer)T 2 . It is easy to check that the function ifjk is 
concave, and attains the maximum at In (1 -|—) with the maximum 

value: 




ft 

rk 



In 




Moreover, r£ < 1 due to the condition fi\rk < ( e rk — l)Aj". By choosing a*, = t£, 
we obtain m- Since a k maximizes t^k, it is optimal in the sense of the worst-case 
performance. □ 
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A.4 Proof of Theorem |T} Local linear convergence 

Scaled proximal operators. In order to prove Theorem |T|and Theorem [2J we 
introduce the notion of scaled proximal operators here. 

Definition 2. Let H £ £++ be a positive definite matrix, and g be a proper, 
lower semi-continuous convex function. We define the operator V^vl) as: 

^h( u ) = (H + <9g) _1 := argmin{g(x) + (l/2)(Hx,x) - (u,x)} . 

We refer to as a scaled proximity operator. Note that if H is the identity 
matrix, V ^ collapses to the standard proximal operator [I|j]. For H £ <S" + , we 
define the weighted norm of x as ||x|| H := (Hx, x) 1 / 2 and its dual norm of y as 
IMIh := (H _1 y,y) 1/2 . 

Lemma 5. Let g be a proper, lower semi-continuous convex function, and let 
H be a positive definite matrix. The mapping V ^ is non-expansive in terms of 
the norm defined by H, i.e.: 

II^h( u )-^h( v )IIh ^ ll u ~ v IIh. Vu > v - (28) 

Proof. Let p := Ph( u ) an d 9 := ’^’h( v )- We have u — Hp £ dg{ p) and v — Hq £ 
dg{ q). By the convexity of g, we have (u-v-Hp + Hq, p — q) > 0. This implies 
that (p — q,u — v) > ||p — q|| H . By the Cauchy-Schwarz inequality, we obtain 
ll u — v IIh — Up — qIIhi which proves the theorem. □ 

Let us consider the distance between x fc+1 and x* measured by ||x fc+1 — x* || . 

By the definition of x fc+1 , we have: 

||x fe+1 - x*|| x * < (1 - a k ) ||x fc - x*|| x * + a k ||s fe - x*|| x * . (29) 

We then derive an upper bound of ||s fc — x*|| , in terms of ||x fc — x*|| t . We de- 
Hne V x * (u) := V^ (u) with H* := V 2 /(x*), S x * (u) := V 2 /(x*)u - V/(u), and 
e x * (u, v) := [V 2 /(x*) - D fc ] (v — u). It follows from the optimality conditions 
(0 and E5l) that s k = V x * (S' x *(x fc ) + e x *(s fe ,x fe )) and x* = P x *(S x *(x*)). By 
Lemma [5] and the triangle inequality, we obtain: 

ll sfc “ X 1lx* - ||*^x*(x^) - S**(x*)||*. + ||e x *(x fe ,s fe )||* 4 . (30) 

Let f k '■= Mf ||x fc — x*|| 2 , we frist bound the term ||S X * (x fc ) — S x * (x*)||** as 
follows: 

\\S X * (x fc ) - S x * (x*)||*. < ^ ~ rfc ~ 1 ||x fc - x*|| . (31) 

II "X rk II "X 

Indeed, let us write, for notational convenience, G k := V 2 / (x* + t (x fc — x*)) — 
V 2 /(x*) and := V 2 /(x*)" 1 / 2 G fc V 2 /(x*) 1 / 2 . By the definition of S x *, we 
have: 

S x * (x fc ) - S x . (x*) = f G k (x fe - x*) dt. 

Jo 

By applying the bound we get: 


(32) 
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which implies: 

||H fc || < max 


1 — e 

fk 


- 1 


e rk - 1 
fk 


-1 = 


- fk - 1 
fk 


From (1321) . we can easily show that: 


| s x * {x k )-S x * (x*)||** < ||H* 


|x fc -x* 


(33) 


(34) 


which is combined with (l33l) to obtain (1311) . 

Next, we bound the second term ||e x * (x fe , s fc )||*, of JUD as follows: 

|je x .(x fc ,s fc )||*, < p» (||s fc -x*|| x , + ||x fc -x*|| x .) , (35) 

where p* := max{Lfc/crjU n — 1 — Lfc/crmax}- Indeed, let us define the matrix: 

H* := V 2 /(x *) -1 / 2 [V 2 /(x*) - D fc ] V 2 /(x*)" 1 /2 
= I - V 2 /(x*)- 1 / 2 D fc V 2 /(x*)- 1 / 2 . 


Then p* is the largest singular value of H*, and: 


‘(x fc 


< p* 


|s fc -x fc | 


< p» (||s fc - X*11 + 


|x fe -X* 


0 . 


which proves 

Finally, suppose that p» G (0,1), by G9l)l. (13U1) . (HITl) . and J35J) , we have the 
upper bound ||x fc+1 — x*|| < 7 *, ||x fe — x*|| , where: 


7 fc : = 


•| 1 — Oik + Oik 


' e rk - f fc - 1 
(1 - p*)f fc 



Therefore, with a proper choice of Lk such that p» € [0,1/2), and for ffc suffi¬ 
ciently small such that 7 k < 1, the sequence {x fe } fc>0 generated by Algorithm!]] 
converges linearly to x*. □ 


A.5 Proof of Theorem [2} Local quadratic convergence 

Let us define P x fc (u) := P® 2/(xfc) (u) with Ha, := V 2 /(x fe ), S x k (u) := V 2 /(x fc ) u- 
V/(u), and e x fc(u, v) := [V 2 /(x fc ) — V 2 /(u)] (v — u). Since s fc is the minimizer 
of m , we have 0 e V/ (x fc ) + V 2 / (x fc ) (s fc — x fc ) + dg (s fe ). Using this opti¬ 
mality condition and the condition a*, = 1 , we can write: 


x fc+1 = ^ U k ) + e x fc (x fc ,s fc )) = s fc 
s fc+i _ p xfe (S xfc ( x fc+1 ) + e x fc (x fc+1 ,s fe+1 )) . 


(36) 


We define Afc+i := ||s fe+1 — x fc+1 || x ,.. It follows by Lemma [5] and the triangle 
inequality that: 


Afc+i < || S x k (x fc+1 ) - S x k (x fc ) || xfc + ||e x fc (s fc 


fc+i _fc+i 


) - e x * (x fc , s fc ) ||* fe (37) 


We can prove, in a similar way as in the proof of (1311) . that the first term of (1371) 
can be upper-bounded as: 


is x , ( X fc+i )— s x k (x fc )n: fc < 


e rk ~r k - 1 
rk 


x fc+ 1 - x k 


(38) 
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By using (17U1) and the fact that e Xfc (x fc , s k ) = 0, the second term of (1371) can be 
upper-bounded as: 


/c —|— 1 \ 


^k+1 „fc+l| 


Je x fc (x'“ 1 % s'" 1 *) || fc < max{e rk — 1, e rk — 1} 11s" 1 ~ — x'“ m 

= (e rk — 1) Afc +1 . (39) 

Combining (ED, (BED, and HMD and assuming that e rk < 2 , we have: 


(2 - e Tk )r k 


By using we estimate Afc+i as: 


\2 _lls fc + 1 Y fc+1 ll < p rk \ 2 

A k +1 ■— ||° x ||xfc+i — e A fc+ 1 ’ 


and thus, provided that r k < ln( 2 ), we obtain an upper bound for A/c+i as: 


Afe+i < 


s i-fe/2 ^ e r k _ Tk _ -g 

r k (2 - e rk ) 


A fc. 


Denote by tj^ in the smallest eigenvalue of V 2 /(x fe ). We have (er^ln) < 
e rk «in ) _1 by (®, which, combining with (1401) . gives us: 


A/c+i 


< 


(e rk -r k - 1 ) 

min' 


A fc, 


CT mtn r k\ atin^-e rk ) 


(40) 


provided that r k < ln(2). Finally, it is easy to check that if r k < ln(4/3) « 
0.28768207, then: 

e rk (e rk — r k — 1 ) 

- --- < 2 r k . 

r k (2 — e rk ) ~ 

Furthermore, we note that X k > \J^^{r k /Mf). Substituting these estimates 
into (1401) we obtain the conclusions of Theorem [2] □ 


A.6 Proof of Lemma |4j Self-concordant-like property 

The concavity of / is straightforward. We now prove that / is self-concordant¬ 
like. Consider the function if>(t) := log (X)I!Li e ait+Mi ), where a := (ai, ■ • • , a m ) 
is a fixed m-dimensional real vector. We define the polynomial P(t; & k ) := 
a fc e ait+/ii a ^e a " t+Aln . Then we have ijj(t) = logP(f, a 0 ), and for any 

k > 0, P(i; a k )' t := d -^l = P(t , a fc+1 ). 

Applying the definition of P(t, a fc ), it is straightforward to obtain the follow¬ 
ing expressions. 

,.,u, P&* 1 ) P(t;a 2 )P(t;a°)-P(t;a 1 ) 2 

V (t) = D /.. „n\ = T>777"7o7’ V W = - 


P(t;a°) P(f; a 0 )’ 


P(t; a 0 ) 2 


and 


^ (f, = - f -' (41) 
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Let us denote bi ~ e ait+Ali . Then we can write as: 


nt) := 


E I<3 ( a > - Bjfbibj 

(Er=i ^) 2 


> 0 , 


and as: 

r\t) = 


Ei<j( ai - a o) 2 bibj ELi( a * + a 3 - 2a k)h} 

(EIU ^) 3 


(42) 


We note that |aj + a, — 2ak\ < + aj + a| < ||a|| 2 for i,j,k = l,...,m, 

and bk > 0 for all k = 1,..., to. Thus we have: 

m 

I y^(aj + aj - 2a k )b k < V& ||a|| 2 ^ bj. 


Substituting this inequality into (H21) we obtain: 

W"{i)\ < \/ 6 ||a || 2 “ aj)2bibj 


(E?=i M 2 


= V 6 ||a|| 2 ip"(t). 


(43) 


Let X be a matrix of size (m + 1) x p, formed from X* 1 ),..., X^ m+1 b Now, 
we define the function /^(X) := log (EIEi 1 e( w(j),xW A for j = l,--- ,N, and 

consider ipj(t) := /j (X + id) = log ^ES* e ait+Mi ^ for given vectors X and d, 
where a,; := (w^Ed^) and /tj := (w^,XW). We note that: 


( m +1 ' 

£ a " 

i=l j 


1/2 


< llw^l 


(44) 


By (l43l) and (1441) . we obtain: 

< \ / 6 |l a ll 2 V , "(i) = \/ 6 ||wW|| 2 ||d|| 2 ^(t). 

This inequality shows that fj is Mj-self-concordant-like, where Mj = \/ 6 ||w^) || 2 . 


By the definition of / we have /(X) = N 1 -4 (X) — E/!=i /j(X) by setting 

Xb) = xW for i = 1, • • • , to, and restricting X m+1 = 0, where .4 is an affine 
operator. This implies that / is M/-self-concordant-like with the constant Mf := 
•\/ 6 iV _1 max {||w ( - J )|| 2 : j = 1, • • • , AT}. □ 









